17 February 2017, University of York
why Research Data Management?
Technology is increasing data collection throughput
Computing power allows more sophisticated analyses
"X" there is a "Computational X"powerful statistical tools freely available
attempts to reproduce published results of many scientific experiments are difficult or impossible to replicate on subsequent investigation, either by independent researchers or by the original researchers themselves.
at the very least we need to aim towards reproducing result from code and data
I think I have seen the worse #otherpeoplesdata yet. There is not a SINGLE thing done right. NOT ONE.
— Timothée Poisot (@tpoi) January 18, 2017
"It's on github. It's documented. It's tested. No need to ask permission, just…take it away and let me know if you run into any problems." https://t.co/gZFWQ2ROQG
— Titus Brown (@ctitusbrown) January 18, 2017
raw ➙ published resultsfocus on reproducibility underpinned by more fundamental qualities:
critics say it's constraining, wasted time for which researchers won't be rewarded.
origins in open source software communities. see SCIENCE X PYTHON
aim to create secure datasets that are easy to use and REUSE
@tomjwebb "Remember what you did in your Master program? Yeah? Well, don't do that anymore."
— Nate Hough-Snee (@NHoughSnee) January 16, 2015
Act as though every short term study will become a long term one @tomjwebb. Needs to be reproducible in 3, 20, 100 yrs
— oceans initiative (@oceansresearch) January 16, 2015
Start early. Make an RMD plan before collecting data.
Think about what technologies to use
This guide for early career researchers explains what data and data management are, and provides advice and examples of best practices in data management, including case studies from researchers currently working in ecology and evolution.
We describe nine simple ways to make it easy to reuse the data that you share and also make it easier to work with it yourself. Our recommendations focus on making your data understandable, easy to analyze, and readily available to the wider community of scientists.
Most university libraries have assistants dedicated to RDM:
@tomjwebb @ScientificData Talk to their librarian for data management strategies #datainfolit
— Yasmeen Shorish (@yasmeen_azadi) January 16, 2015
@tomjwebb at minimum conform units, data fields to existing public databases, traits=TRY, collections=DarwinCore, plots=SALVIASVegBank
— Brian J. Enquist (@bjenquist) January 16, 2015
@tomjwebb record every detail about how/where/why it is collected
— Sal Keith (@Sal_Keith) January 16, 2015
@tomjwebb stay away from excel at all costs?
— Timothée Poisot (@tpoi) January 16, 2015
read only@tomjwebb @tpoi excel is fine for data entry. Just save in plain text format like csv. Some additional tips: pic.twitter.com/8fUv9PyVjC
— Jaime Ashander (@jaimedash) January 16, 2015
@jaimedash just don’t let excel anywhere near dates or times. @tomjwebb @tpoi @larysar
— Dave Harris (@davidjayharris) January 16, 2015
God bless the UK Gov’t Statistical Service Good Practice Team. Sane spreadsheet advice. https://t.co/MLQMvnBzkV pic.twitter.com/OD7GzfVKLW
— Jenny Bryan (@JennyBryan) November 27, 2014
@tomjwebb databases? @swcarpentry has a good course on SQLite
— Timothée Poisot (@tpoi) January 16, 2015
@tomjwebb @tpoi if the data are moderately complex, or involve multiple people, best to set up a database with well designed entry form 1/2
— Luca Borger (@lucaborger) January 16, 2015
@tomjwebb Entering via a database management system (e.g., Access, Filemaker) can make entry easier & help prevent data entry errors @tpoi
— Ethan White (@ethanwhite) January 16, 2015
@tomjwebb it also prevents a lot of different bad practices. It is possible to do some of this in Excel. @tpoi
— Ethan White (@ethanwhite) January 16, 2015
@ethanwhite +1 Enforcing data types, options from selection etc, just some useful things a DB gives you, if you turn them on @tomjwebb @tpoi
— Gavin Simpson (@ucfagls) January 16, 2015
@tomjwebb It has to be interoperability/openness - can I read your data with whatever I use, without having to convert it?
— Paul Swaddle (@paul_swaddle) January 16, 2015
.csv copy would need to be saved.write.csv(df, paste("variable_", res, month, sep ="_"))
df <- read.csv(paste("variable_", res, month, sep ="_"))
NA or NULL are also good options0. Avoid numbers like -999read.csv() utilitiesna.string: character vector of values to be coded missing and replaced with NA to argument egstrip.white: Logical. if TRUE strips leading and trailing white space from unquoted character fieldsblank.lines.skip: Logical: if TRUE blank lines in the input are ignored.fileEncoding: if you're getting funny characters, you probably need to specify the correct encoding.read.csv(file, na.strings = c("NA", "-999"), strip.white = TRUE,
blank.lines.skip = TRUE, fileEncoding = "mac")
Viewer(df)
summary(df)df``; -head(df)` will show you and str(df) are useful)What information would other users require to combine their your data with their's?
temporal (time of day, day, month, year, season)geography (lat, lon)species name; authority / source@tomjwebb don't, not even with a barge pole, not for one second, touch or otherwise edit the raw data files. Do any manipulations in script
— Gavin Simpson (@ucfagls) January 16, 2015
@tomjwebb @srsupp Keep one or a few good master data files (per data collection of interest), and code your formatting with good annotation.
— Desiree Narango (@DLNarango) January 16, 2015
Do not manually edit raw data
Keep a clean pipeline of data processing from raw to analytical.

master copy of files
R@tomjwebb Back it up
— Ben Bond-Lamberty (@BenBondLamberty) January 16, 2015